-
Notifications
You must be signed in to change notification settings - Fork 62
New pass Reduce variable liveness #3965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…d-op closer to use-op. Add a test.
This comment was marked as outdated.
This comment was marked as outdated.
third_party/intel/lib/TritonIntelGPUTransforms/ReduceRegisterPressure.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Maxime France-Pillois <[email protected]>
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Show resolved
Hide resolved
Signed-off-by: Maxime France-Pillois <[email protected]>
Signed-off-by: Maxime France-Pillois <[email protected]>
Signed-off-by: Maxime France-Pillois <[email protected]>
Signed-off-by: Maxime France-Pillois <[email protected]>
Signed-off-by: Maxime France-Pillois <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a loop sink pass in IGC. Can you please create an issue for IGC team to investigate why it doesn't catch the case of FA with the shape that gives the most gain?
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
|
||
/// Create a prefetch operation for the given load operation. | ||
static void createPrefetchOp(tt::LoadOp loadOp) { | ||
Operation *op = loadOp.getPtr().getDefiningOp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when did we check that loadOp.getPtr()
is an operation? do we need to add that to isLoadCandidate?
Or should we add the support of when pointer is a region argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing. A check has been added to isLoadCandidate
.
As the pass adds a prefetch right after the defining op, I'm concerned that adding this prefetch in another region (in the case the load ptr has been defined in another region) could have side effects on the cache (as an early data fetch could mean evincing data that are still needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we care about the case that the pointer directly come from function argument?
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Outdated
Show resolved
Hide resolved
It is good to have the reduce variable liveness as the beginning for liveness optimization in the Triton middle end. The optimization relies on the cache to hold the values that we may reuse in the loop. But the cache system is not fully controllable by the program. The better we can enhance it with the usage of shared local memory and make it some how like RegisterToMem pass for general case. |
@mfrancepillois can you do a Triton Benchmark run with this PR to identify improvement (or degradations - hopefully none) in all the microbmks we have ? |
Operation *forOp) { | ||
// Only pointer to tensor are considered to be moved | ||
if (!mlir::triton::isTensorPointerType(loadOp.getPtr().getType())) | ||
if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional]
if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType())) | |
if (!mlir::triton::isTensorPointerType(loadOp.getResult().getType())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would limite the optimization to block pointer loads. That is conservative and I am OK with limiting the pass in this PR. Generally speaking the pass should work for tensor of ptrs as well as block pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current pass does handle block pointer AND pointer of tensors (with the condition that the load has an empty mask).
Signed-off-by: Maxime France-Pillois <[email protected]>
third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp
Show resolved
Hide resolved
After a few improvements to this pass (handling multiple users for the For flash-attention, we have the following performance: Other benchmarks do not seem to be significantly impacted by this pass. |
// each "for loop" given that the liveness of variables may have changed | ||
// as a result of the code, and specifically `LoadOps`, being modified | ||
// by the pass. | ||
Liveness livenessAnalysis(rootOperation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce compile time we should detect whether the pass made any changes to the code and only rerun the analysis if changes were made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has been modified to run the analysis only when needed.
} | ||
|
||
Operation *rootOperation = getOperation(); | ||
rootOperation->walk([&](scf::ForOp forOp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the pass for now only handles one kind of loop (scf.for). Is OK as a first cut, we might want/need to enhance it to also support while loops in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment has been added to keep track of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial round of code review comments.
#define LARGE_TENSOR_SIZE_THRESHOLD_IN_BYTES \ | ||
LARGE_TENSOR_MAJOR_SHAPE_THRESHOLD *LARGE_TENSOR_MINOR_SHAPE_THRESHOLD * 2 | ||
|
||
static unsigned getSizeInBytes(RankedTensorType &tensorType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation for this function and the next pls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] static is unnecessary because these utilities are in an anonymous namespace.
#include "mlir/Transforms/GreedyPatternRewriteDriver.h" | ||
#include "triton/Dialect/TritonGPU/IR/Dialect.h" | ||
#include "triton/Dialect/TritonGPU/Transforms/Passes.h" | ||
#include "llvm/Support/Debug.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] move in the section where other llvm include headers are "included".
#include "triton/Dialect/TritonGPU/Transforms/Passes.h" | ||
#include "llvm/Support/Debug.h" | ||
|
||
#include "intel/include/Analysis/Liveness.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]" lets try to keep include headers in their sections (all intel headers together, all triton upstream headers together, etc...)
|
||
namespace { | ||
|
||
#define TOTAL_BLOCK_SIZE_THRESHOLD_IN_BYTES 32768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to use C++ static constexpr instead of #defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has been updated this way.
// The variable is considered as a long life span elected for being moved if: | ||
// The live-in variables of the forOp consist in a large amount of bytes and | ||
// The variable defined by `v` is a large tensor (with large amount of element | ||
// in the minor dimenssion) and The variable liveness of `v` expends before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -> the
return false; | ||
|
||
for (triton::DotOp dot : dotsInFor) { | ||
auto aVals = getLoad(dot.getA()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use static types on LHS pls.
#dot1 = #ttg.dot_op<{opIdx = 1, parent = #dpas, kWidth=2}> | ||
module attributes {ttig.support_sg_2d_block, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32} { | ||
tt.func public @matmul_kernel_small_tensor(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) { | ||
// CHECK-LABEL: tt.func public @matmul_kernel_small_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove tt.func public
here
ttig.prefetch %1 {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 0, 0>} : !tt.ptr<tensor<64x256xf16, #dot1>> | ||
%4:2 = scf.for %arg2 = %c0_i32 to %c64_i32 step %c64_i32 iter_args(%arg3 = %cst, %arg4 = %1) -> (tensor<16x256xf32, #dpas>, !tt.ptr<tensor<64x256xf16, #dot1>>) : i32 { | ||
// CHECK: scf.for | ||
// CHECK-NOT: tt.load {{.*}} : !tt.ptr<tensor<16x64xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so this test checks that the load for operand A (opIdx==0) is not sinked into the loop. Would be helpful to add a COM to all the tests to briefly explain what each test is designed to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments have been added to describe the goal of each tests.
…ents and improve code quality. Signed-off-by: Maxime France-Pillois <[email protected]>
Add a new pass to reduce the variable liveness by prefetching data then moving load op closer to use-op.